Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

304 ◾ Bioinformatics

insert transposomes into the metagenomic DNA. The tagmentation is followed by PCR

using index primers which enables amplification and subsequent indexing of the sample

libraries (barcoding) to allow multiplexing. The library preparation is followed by sequenc-

ing and production of the raw data in FASTQ files. The steps of read quality assessment and

processing are, to some extent, similar to the steps discussed in Chapter 1. The purpose of

the quality control is to reduce sequence biases or artifacts by removing sequencing adap-

tors, trimming low-quality ends of the reads, and removing duplicate reads. If the DNA is

extracted from a clinical sample, an additional quality control step is required which is to

remove the contaminating host DNA or non-target DNA sequences. If we need to perform

between-sample differential diversity analysis, we may also need to draw a random sub-

sample from the original sample to normalize read counts.

After the step of quality control, there are two strategies that can be followed for the

metagenomic raw data. The first one is to assemble the metagenomes using de novo genome

assembly method and the second one is an assembly-free approach similar to amplicon-

based method. Each of these strategies may address different kinds of questions. The types

and algorithms of the de novo assembly were discussed in Chapter 3. However, in the

shotgun metagenomics, a new step is introduced. This step is called metagenomic bin-

ning, which aims to separate the assembled sequences by species so that the assembled

contigs in a metagenomic sample will be assigned to different bins in FASTA files. A bin

will correspond to only one genome. A genome built with the process of binning is called

Metagenome-Assembled Genome (MAG). Binning algorithms adopt several ways to per-

form binning. Some algorithms use taxonomic assignment and others use properties of

contigs like GC-content of the contigs, nucleotide composition, or the abundance. Binning

algorithms use two approaches for assigning contigs to species: supervised machine learn-

ing and unsupervised machine learning. Both approaches use similarity scores to assign

a contig to a bin. Since many of the microbial species have not been sequenced and hence

some of the reads may not map to reference genomes, it is good practice to not rely on

mapping to reference genomes. Binning-based nucleotide composition of a contig has been

found useful in separating genomes into possible species. The nucleotide composition of

a contig is the frequency of k-mers in the contig, where k can be any reasonable integer

(e.g., 3, 4, 5, …). It has been found that different genomes of microbial species have dif-

ferent frequencies that may discriminate the genomes into potential taxonomic groups.

A machine learning algorithm like naïve Bayes and other machine learning algorithms

are used for taxonomic group assignment. However, features, more powerful than the

sequence composition, are required to deal with the complexity in the sequences of contigs.

The unsupervised machine learning tools cluster contigs into bins without requiring prior

information. There are several binning programs using different algorithms. MetaBAT 2

[1] uses an adaptive binning algorithm that does not require manual parameter tuning as

the case with its previous version. Its algorithm consists of multiple aspects, including nor-

malized tetranucleotide frequency (TNF) scores, clustering, and steps to recruit smaller

contigs. Moreover, the computational efficiency has been increased compared to the previ-

ous version. MaxBin [2] uses nucleotide composition and contig abundance information

to group metagenomic contigs into different bins; each bin represents one species. MaxBin

algorithm uses tetranucleotide frequencies and scaffold coverage levels to estimate the